Reproducible and clinically translatable deep neural networks for cervical screening

Cervical cancer is a leading cause of cancer mortality, with approximately 90% of the 250,000 deaths per year occurring in low- and middle-income countries (LMIC). Secondary prevention with cervical screening involves detecting and treating precursor lesions; however, scaling screening efforts in LMIC has been hampered by infrastructure and cost constraints. Recent work has supported the development of an artificial intelligence (AI) pipeline on digital images of the cervix to achieve an accurate and reliable diagnosis of treatable precancerous lesions. In particular, WHO guidelines emphasize visual triage of women testing positive for human papillomavirus (HPV) as the primary screen, and AI could assist in this triage task. In this work, we implemented a comprehensive deep-learning model selection and optimization study on a large, collated, multi-geography, multi-institution, and multi-device dataset of 9462 women (17,013 images). We evaluated relative portability, repeatability, and classification performance. The top performing model, when combined with HPV type, achieved an area under the Receiver Operating Characteristics (ROC) curve (AUC) of 0.89 within our study population of interest, and a limited total extreme misclassification rate of 3.4%, on held-aside test sets. Our model also produced reliable and consistent predictions, achieving a strong quadratic weighted kappa (QWK) of 0.86 and a minimal %2-class disagreement (% 2-Cl. D.) of 0.69%, between image pairs across women. Our work is among the first efforts at designing a robust, repeatable, accurate and clinically translatable deep-learning model for cervical screening.

detection or classification to more nuanced versions with direct relevance for risk stratification of patients and precision medicine 6 .
The advancements made by AI in image classification tasks over the past several years have also reached the cervical imaging domain, for instance, as an assistive technology for cervical screening 7 .Globally, cervical cancer is a leading cause of cancer morbidity and mortality, with approximately 90% of the 250,000 deaths per year occurring in low-and middle-income countries (LMIC) 8,9 .Persistent infections with high-risk human papillomavirus (HPV) types are the causal risk factor for subsequent carcinogenesis 10,11 .Accordingly, primary prevention via prophylactic HPV vaccination 12 , and secondary prevention via HPV-based screening for precursor lesions ("precancer") are the recommended preventive methods 13,14 .Crucially, screening is the key secondary prevention strategy, with the long process of carcinogenic transformation from HPV infection to invasive cancer providing an opportunity for detecting the disease at a stage when treatment is preventive or, at least, curative 13 .
However, implementation of an effective cervical screening program in LMIC, in line with WHO's elimination targets 15 , is hindered by barriers to healthcare delivery.Cytology and other current tests are costly and have substantial infrastructure requirements due to the need for laboratory infrastructure, transport of samples, multiple visits for screening and treatment, and (in the case of cytology) highly trained cytopathologists and colposcopists for management of abnormal results 16 .As a less resource-intensive alternative, some have established screening of the cervix by visual inspection after application of acetic acid (VIA) to identify precancerous or cancerous abnormalities via community-based programs, followed by treatment of abnormal lesions using thermal ablation or cryotherapy and/or large loop excision of the transformation zone (LLETZ) 17,18 .The major limitation of VIA, however, is its inherently subjective and unreliable nature, resulting in high variability in the ability of clinicians to differentiate precancer from more common minor abnormalities, which leads to both undertreatment and overtreatment 19,20 .
Given the severe burden of cervical cancer and the lack of widely disseminated screening approaches in LMIC, a critical need exists for methods that can more consistently, inexpensively, and accurately evaluate cervical lesions and subsequently enable informed local choice of the appropriate treatment protocols.
There has been a relative paucity of prior work utilizing AI and DL for cervical screening based on cervical images.Crucially, the existing work also largely suffers from overfitting of the model on the training data.This leads to apparent initial promise, with either poor performance on or absence of held-aside test sets for evaluating true model performance.When deployed in different settings, these models fail to return consistent scores and accurately detect precancers [21][22][23][24] .This poses significant concerns when considering downstream deployment in various LMIC, where model predictions directly inform the course of treatment, and where screening opportunities are limited.
In this work, we address the aforementioned concerns through three contributions, which are generalizable to clinical domains outside of cervical imaging:

Improved reliability of model predictions
We employ a comprehensive, multi-level model design approach with a primary aim of improving model reliability.Model reliability or repeatability, is defined as the ability of a model to generate near-identical predictions for the same woman under identical conditions, ensuring that the model produces precise, reliable outputs in the clinical setting.Specifically, we consider multiple combinations of model architectures, loss functions, balancing strategies, and dropout.Our final model selection for the classifier, termed automated visual evaluation (AVE), is based on a criterion that first prioritizes model reliability, followed by class discrimination or classification performance, and finally reduction of grave errors.

Improved clinical translatability: multi-level ground truth
The large majority of current medical image classification and radiogenomic pipelines that utilize AI and DL, across clinical domains, use binary ground truths.Our clinical intuition from working with binary models as well as prior empirical work have informed us that these models frequently fail to capture the inherent uncertainty with ambiguous samples [21][22][23][24] .These uncertain samples are of two intersecting kinds: samples that are uncertain to the clinician ("rater uncertainty") and samples that are uncertain to the model i.e., where the model reports low confidence scores ("model uncertainty"); both instances can lead to incorrect classification and subsequent misinformed downstream actions for these patients.Crucially, real-world clinical oncology samples, across domains such as cervical, prostate and breast, and across hospitals/institutions, include many uncertain cases [25][26][27] .To address both levels of ambiguity, we employ several multi-level, ordinal ground truth delineation schemes in our model selection.

Improved downstream clinical-decision making: combination of HPV risk stratification with model predictions
A number of different cancers have identified "sufficient" causes.Examples across this spectrum range from the presence of BRAF V600E mutation for the papillary subtype for craniopharyngioma 28 , to the presence of BRCA1 or BRCA2 mutations for breast cancer [29][30][31] .Cervical cancer is unique among common neoplasms in that HPV is virtually necessary and is present in > 95% of cases.Different HPV types predict higher or lower absolute risk, e.g., HPV 16 is the highest risk type, followed by HPV 18, while other types pose weaker or no risk [32][33][34] .In our work, we combined HPV typing and its strong risk stratification with our visual model predictions, to create a risk score that can be adapted to local clinical preferences for "riskaction" thresholds.This is generalizable across clinical domains where additional clinical variables and risk associations significantly determine patient outcomes.

Repeatability analysis
Table 2 highlights the summary of the repeatability analysis (Stage I), reporting the mean, median and adjusted linear regression β values for QWK.We evaluated the metrics overall and within each design choice category,   1, and Supp.Methods for detailed description and breakdown of the studies by ground truth) used to generate the final dataset on the middle panel, which is subsequently used to generate a train and validation set, as well as two separate test sets.The intersections of model selection choices on the bottom panel are used to generate a compendium of models trained using the corresponding train and validation sets and evaluated on the "Model Selection Set"/"Test Set 1", optimizing for repeatability, classification performance, reduced extreme misclassifications and combined risk-stratification with high-risk human papillomavirus (HPV) types."Test Set 2" is utilized to verify the performance of top candidates that emerge from evaluation on the "Model Selection Set"/"Test Set 1".SWT: Swin Transformer; QWK: quadratic weighted kappa; CORAL: CORAL (consistent rank logits) loss, as described in the "Methods" section.
dropping the worst performing design choices both overall and within each category.Overall, this resulted in 19.0% of our design choices being dropped from further consideration (Table 2, shaded in bold; Fig. 3a, muted bars).Within each design choice category, this amounted to dropping the design choices that had adjusted linear regression β values > 0.06 below reference.Specifically, the design choices that were dropped in Stage 1 include the resnest50 architecture, focal and CORAL loss functions, and models trained without dropout.Here, we adopted a conservative approach, choosing to keep design choices that resulted in median QWK and corresponding adjusted β values that are relatively close and not clearly distinguishable from each other and only dropped the clearly worst performing choices; for instance, we decided to keep both the "3 level subsets" (β = − 0.026) and the "5 level all patients" (β = − 0.025) design choices within the "Multilevel Ground Truth" design category, and pass them through to Stage 3.

Classification performance analysis
Table 3 highlights the summary of the classification performance analysis (Stage II), reporting the median and the interquartile ranges for each of our two key classification metrics: (1) Youden's index and (2) extreme misclassifications, as well as the adjusted linear regression β for each design choice.Similar to Stage 1, we evaluated the metrics both overall and within each design choice category, dropping the worst performing design choices at this stage in a two-level approach.
In the first level, we looked at the Youden's index across all design choices and dropped the worst performing choices; this resulted in 3 choices (SWT architecture, no balancing, 5-level ground truth) or 17.6% of the remaining choices being dropped and amounted to dropping choices that had median Youden's index of < 150 (Table 3, shaded in bold; Fig. 3b, muted bars); this was further supported by other design choices within each design choice category having positive adjusted linear regression β values.In the second level, we considered two factors: (1) median extreme misclassification percentages (% precancer+ as normal and % normal as precancer+); and (2) practical reasons, dropping design choices due to a combination of these two factors.This resulted in three balancing strategies (Sampling 1:1:2, 1:1:4 and 2:1:1) and the "3 level subsets" ground truth mapping, or 28.6% of the remaining design choices being dropped (Table 3, shaded in italics).Weighted sampling by using preassigned label weights per class for the loading sampler (such as 1:1:4) is imprecise since weights are not adjusted relative to the dataset-specific class imbalance; this skews the model in making predictions along the lines of the assigned weights.This can be seen among the sampling strategies dropped: sampling 1:1:4 had a high rate of median % normal predicted as precancer+ (27.4%), while sampling 2:1:1 had a high rate of median % precan-cer+ predicted as normal (24.3%).The "3 level subsets" ground truth mapping was dropped for practical reasons: it was generated from the 5-level map by omitting the GL and GH labels to attempt to generate further distinction or discontinuity between the three classes (normal, GM, precancer+) during model experimentation.Both the "5-level all patients" and the "3-level subsets" ground-truth mapping are impractical due to the limited clinical data (either HPV, histology and/or cytology) we anticipate having available in the field to generate 5 distinct levels of ground truth, thereby rendering retraining, validation and implementation of these approaches challenging.4 highlight the 10 best performing models that emerge following Stages 1, 2 and 3 of our model selection approach.All 10 models perform similarly among HPV positive women in the full 5-study set, while showing notable differences per study as shown in the NHS subset of the full 5-study set, measured by the

Classification and repeatability analysis: "test set 2"
Figure 5a and Table 5 highlight the additional classification (1. % precancer+ as normal and 2. % normal as precancer+), and repeatability (1. % 2-class disagreement and 2. QWK) metrics from the predictions of each of the top 10 models on "Test Set 2", while Fig. 6 takes a deeper look by comparing individual model predictions across 60 images for these top 10 models on "Test Set 2".The top 10 models that pass through all stages of our model selection approach utilize the following configurations: • Architecture: densenet121 or resnet50 • Loss function: quadratic weighted kappa (QWK) or cross-entropy (CE) • Balancing strategy: remove controls or balanced sampling • Dropout: Monte-Carlo (MC) dropout (spatial) • Multi-level ground truth: 3 level all patients (Normal, Gray Zone, Precancer+) • Model type: multiclass classification Based on the individual performances of the models in terms of degree of extreme misclassifications and repeatability (Table 5, Fig. 5a) and additional risk stratification (Table 4, Fig. 4), our best performing model (# 36) has the smallest rate of overall extreme misclassifications (5.9% precancer+ as normal, 4.2% normal as precancer+), one of the highest repeatability performance (repeatability QWK = 0.8557, 0.69% 2-class disagreement on repeat images across women), and the highest additional risk stratification in the NHS subset of the full 5-study dataset, our screening population (difference between HPV-AVE combined AUC and HPV AUC = 0.164).Among the top 10 models, model # 36 utilizes the following unique design choices: • Architecture: densenet121 • Loss function: quadratic weighted kappa (QWK) • Balancing strategy: remove controls Figure 5b highlights key performance metrics of the top ranked model (# 36) on "Test Set 2", as captured by the corresponding (i) ROC curves, (ii) confusion matrix, (iii) histogram of the model predicted score and (iv) Bland-Altman plot.The ROC curve in (i) demonstrates excellent discrimination of the normal (class 0) and precancer+ (class 2) categories, with corresponding AUROC's of 0.88 (class 0 vs. rest) and 0.82 (class 2 vs. rest) respectively.This is reinforced by the confusion matrix in (ii), which highlights a total extreme misclassification (extreme off diagonals) rate of only 3.4%, and by the histogram in (iii), which illustrates the strong class separation in model predicted score ; specifically, (iii) highlights that the model confidently predicts the largest clusters of each of the three ground truth classes correctly as shown by the peaks around score 0.0, 1.0 and 2.0.Finally, the Bland-Altman plot in (iv) highlights the model performance in terms of repeatability: each point on this plot refers to a single woman, with the y-axis representing the maximum difference in the score across repeat images per woman, and the x-axis plotting the mean of the corresponding score across all repeat images per woman.Repeatability is evaluated using the 95% limits of agreement (LoA), highlighted by the blue dotted lines in (iv) on either side of the mean (central blue dotted line); for model # 36, the 95% LoA is quite narrow, with most points clustered around 0 on the y-axis suggesting that score values of the model on repeat images taken on the same visit for each woman are quite similar; here, the 95% LoA adjusted for the number of classes and presented as a fraction of the possible value range is 0.240 (± 0.038).Figure 6 reinforces the validity of our approach for model selection and optimization by providing a detailed comparison of model performance at the individual image level, with the top models performing desirably with respect to the clinical problem we are aiming to address.Incorporation of a gray zone class, together with MC dropout and loss functions that penalize misclassifications between the extreme classes ensures that we deal with ambiguity with cases at the class boundaries.For instance, among these randomly selected 60 images, the best performing model (# 36) has the lowest rate of extreme misclassifications (none), while predicting a wide www.nature.com/scientificreports/enough gray zone that adequately encapsulates the clinical ambiguity with uncertain cases: these are cases for which even clinically trained colposcopists and gynecologic oncologists would find determination of precan-cer+ status challenging.

Discussion
Despite the advancements made by AI in clinical classification tasks, key concerns hindering model deployment from bench to clinical practice include model reliability and clinical translatability.An incorrect, unreliable, or unrepeatable model prediction has the potential to lead to a cascade of clinical actions that might jeopardize the health and safety of a patient.Therefore, it is essential that models designed with the goal of clinical deployment be specifically optimized for improved repeatability and clinical translation.Our work addresses these concerns of reliability and clinical translatability.We optimize our model selection approach with improved repeatability as the primary stage (Stage I) of our selection criterion-ensuring that only design choices that produce repeatable, reliable predictions across multiple images from the same woman's visit, are passed through to the next stage of evaluation for classification performance.Our work builds on prior work highlighting improvements in repeatability of model predictions made by certain design choices 36,37 .Our work also stands out among the paucity of current approaches that have utilized AI and DL for cervical screening [21][22][23][24] ; as aforementioned, these are largely plagued by overfitting and no consideration of repeatability.The dearth of work investigating repeatability of AI models designed for clinical translation in the current DL and medical image classification literature has meant that no rigorous study, to the best of our knowledge, has Table 3. Classification performance analysis.Classification performance analysis on "Model Selection Set"/"Test Set 1", highlighting Youden's index (YI) and extreme misclassification statistics-median with interquartile range (IQR) and adjusted linear regression (LR) β values-for design choices within each design choice category for our automated visual evaluation (AVE) classifier, after filtering for repeatability (  www.nature.com/scientificreports/employed repeatability as a model selection criterion.We posit that our work could motivate further efforts to include repeatability as a key criterion for clinical AI model design.Subsequent design choices of our work are optimized to improve clinical translatability.Prior work [21][22][23][24] has shown us that while binary classifiers for cervical image-based cervical precancer+ detection can achieve competitive performance in a given internal seed dataset, they translate poorly when tested in different settings; uncertain cases can be misclassified, and predictions tend to oscillate between the two classes.This oscillation phenomenon could prevent a precancer+ woman from accessing further evaluation (i.e., false negative) or direct a normal woman through unnecessary, potentially invasive tests (i.e., false positive).False negatives are especially problematic in LMIC where screening is limited and represent a missed opportunity to detect and treat precancer via excisional, ablative, or surgical methods, in order to avert cervical cancer 13,38 .We further assess the importance of our multi-class approach and incorporation of MC dropout by highlighting the comparison between binary and three-class models, with and without MC dropout, in terms of key classification and repeatability metrics on "Test Set 2" in Table 6.Table 6 highlights that three-class models perform better than binary models in terms of both repeatability and classification metrics, while MC dropout improves repeatability.This is conceptually justified since a three-level ground truth with a quadratic weighted kappa loss function that penalizes misclassification between the boundary classes is designed to limit extreme classifications; we find this to be true in our case.Furthermore, MC dropout is a model regularization technique known to prevent overfitting, and we find that it also improves repeatability 36 .By incorporating a multi-class approach and a loss function that heavily penalizes extreme misclassifications, we improve reliability of the model-predicted normal and precancer+ categories, and further ensure that women ascribed to the intermediate classes are recommended for additional clinical evaluation.
Finally, our assessment of model performance was based on its ability to stratify precancer+ risk within each of the four risk-based HPV groupings (Stage III of our model selection approach, as described in "Methods").For our model to successfully be used in a triage setting, it must do more than mimic the risk stratification of HPV groupings, it must order risk within each HPV-type group correctly.Given the high negative predictive value of HPV, we believe that our model can act as an effective triage tool for HPV positive women.
Our prior work has informed us that the HPV positive women in the NHS subset better represent a typical screening population: specifically, the NHS subset represents women who tested HPV-positive in any given population with an intermediate HPV prevalence 35 .The other 4 subsets within the full 5-study dataset comprise of women referred from HPV-based/cytology-based referral clinics: this represents a colposcopy population, which has a higher disease prevalence.We optimize each stage (I, II and III) of our model selection approach on the full 5-study dataset to better capture the variability in cervical appearance on imaging.At the end of this selection, we find that our top models do not perform meaningfully differently among HPV positive women in the full 5-study dataset, highlighted by similar HPV-AVE AUC values across the models in the "HPV positive 5 study" column on Table 4.For the final selection of the top candidates, given our goal of using AVE as a triage tool for HPV positive women in a screening setting, we therefore narrow our focus to the combined HPV-AVE AUC in the NHS HPV positive subset ("HPV positive NHS" column on Table 4; Fig. 4) for each model on the "Model Selection Set"/"Test Set 1" and confirm performance of the top candidates on an additional held-aside test set, "Test Set 2" (see "Methods", Table 5 and Fig. 5a).www.nature.com/scientificreports/Despite the multi-institutional, multi-device and multi-population nature of our final, collated dataset; the use of multiple held-aside test sets; and the exhaustive search space utilized for our algorithm choices, our work may be limited by sparse external validation.Forthcoming work will evaluate our model selection choices on several additional external datasets, assessing out-of-the-box performance as well as various transfer learning, retraining and generalization approaches.Future work will additionally optimize our final model choice for use on edge devices, thereby promoting deployability and translation in LMIC.
In this work, we utilized a large, multi-institutional, multi-device and multi-population dataset of 9,462 women (17,013 images) as a seed and implemented a comprehensive model selection approach to generate a diagnostic classifier, termed AVE, able to classify images of the cervix into "normal", "gray zone" and "precan-cer+" categories.Our model selection approach investigates various choices of model architecture, loss function, balancing strategy, dropout, and ground truth mapping, and optimizes for (1) improved repeatability; (2) Table 5. Classification and Repeatability results on Test Set 2 for top performing models.Classification and repeatability results on "Test Set 2" for top 10 best performing models, highlighting % precancer + as normal (% p as n) and % normal as precancer + (% n as p), the % 2-class disagreement between image pairs across women (% 2-Cl.D.), and the quadratic weighted kappa (QWK) values on the discrete class outcomes for paired images across women, for each model.EM: extreme misclassifications.www.nature.com/scientificreports/classification performance; and (3) high-risk HPV-type-group combined risk-stratification.Our best performing model uniquely (1) alleviates overfitting by incorporating spatial MC dropout to regularize the learning process; (2) achieves strong repeatability of predicted class across repeat images from the same woman; (3) addresses rater and model uncertainty with ambiguous cases by utilizing a three-level ground truth and QWK as the loss function to penalize extreme (between boundary class) misclassifications; and (4) achieves a strong additional risk-stratification when combined with the corresponding HPV type group within our screening population of interest.While our initial goal is to implement AVE primarily to triage HPV positive women in a screening setting, we expect our approach and selected model to also provide reliable predictions for images obtained in the colposcopy setting.Our model selection approach is generalizable to other clinical domains as well: we hope for our work to foster additional, carefully designed studies that focus on alleviating overfitting and improving reliability of model predictions, in addition to optimizing for improved classification performance, when deciding to use an AI approach for a given clinical task.

Overview
This study set out to systematically compare the impact of multiple design choices on the ability of a deep neural network (DNN) to classify cervical images into delineated cervical cancer risk categories.We combined images of the cervix from five studies (Supp.Table 1) into a large convenience sample for analysis.We subsequently labelled the images into three distinct multi-level ground truth labelling approaches: (1) a 5-level map, which included normal, gray-low (GL), gray-middle (GM), gray-high (GH), and precancer+ (termed "5 level all patients"); (2) a 3-level map which combined the intermediate three labels (GL, GM, GH) into one single gray zone (termed "3 level all patients"); and (3) an additional 3-level map which excluded the GL and GH labels, and considered only the normal, GM and precancer+ labels (termed "3 level subsets").The choice of multi-level ground truth labelling for model selection was motivated by our previous work and intuition revealing the failure of binary models, as well as our specific clinical use case.Table 1 highlights the population level and dataset level characteristics for our final, collated dataset used for training and evaluation, highlighting the distribution of histology, cytology, HPV types, population-level study, age, and number of images per patient within each of the five ground truth classes.We subsequently identified four key design decision categories that were systematically implemented, intersected, and compared.These included: model architecture, loss function, balancing strategy, and implementation of dropout, as highlighted in Fig. 1.The choice of balancing strategy for a particular model determined the ratios of randomly chosen train and validation sets used during training.We subsequently trained multiple classifiers using combinations of these design choices and generated predictions on a common test set ("Model Selection Set"/"Test Set 1") which was used to compare and rank models based on repeatability, classification performance, and HPV type-group combined risk stratification.Finally, we confirmed the performance of the top models on a second held-aside test set ("Test Set 2") to mitigate the impact of chance on the best performing approaches.

Dataset
Included studies Cervical images used in this analysis were collected from five separate study populations labelled NHS, ALTS, CVT, Biop and D Biop (Table 1; Fig. 1).Detailed descriptions for each study can be found in the supplementary methods section.The final dataset was collated into a large convenience sample comprising of a total of 17,013 images from 9,462 women.Table 6.Classification and Repeatability metrics comparing binary with multiclass models, both with and without Monte Carlo (MC) dropout.Comparison of binary and multiclass models on "Test Set 2", highlighting relevant classification metrics (% p as n: % precancer+ as normal; % n as p: % normal as precancer+; and % ext.mis.: % extreme misclassifications) and repeatability metrics (% ext.dis.: % extreme disagreement i.e. extreme disagreement between image pairs across women; QWK: quadratic weighted kappa; and 95% LoA: 95% limits of agreement on a Bland Altman plot, highlighting the continuous score repeatability).All four models: binary, binary with Monte-Carlo (MC) dropout, three-class and three-class with MC dropout incorporate the same configurations as the top performing model (#36), with the only exception being the presence or absence MC dropout and whether the models output binary or three-class predictions (as indicated by the corresponding name).All three-class models were trained using the "3 level all patients" ground truth mapping (normal, gray zone, precancer+), while the binary models were trained on binary (normal, precancer+) ground truths.The metrics highlighted here indicate that three-class models perform better than binary models in terms of both repeatability and classification metrics, while MC dropout improves repeatability.www.nature.com/scientificreports/Here, ω is the weight matrix for quadratic penalization for every pair i, j , C is the number of classes, O is the confusion matrix represented by the matrix multiplication between the true value and prediction vectors, and E is the outer product between the true value and prediction vectors.
Here σ is the sigmoid function, ŷ is the model's output, and y is the level-encoded ground truth.Three balancing strategies were evaluated to deal with the dataset's class imbalance: weighting the loss function, modifying the loading sampler, and rebalancing the training and validation sets.These strategies were only applied during the training process and were compared against training without balancing.To emphasize the least frequent labels, one approach was to apply weights to the loss function in proportion to the inverse of the occurrence of each class label.A second approach was to reweight the loading sampler to present images associated with each label equally as well as with specific weights-2:1:1, 1:1:2, or 1:1:4 (Normal : Gray Zone : Precancer+).The final balancing strategy, henceforth termed "remove controls", involved randomly removing "normal" (class 0) women from the training and validation sets and reallocating them to "Model Selection Set"/"Test Set 1", in order to better rebalance the training and validation set labels; in this approach, a total of 2383 women (4555 images) from the initial train set, and 410 women (780 images) from the initial validation set were reallocated to the test set.The final class balance in the train and validation sets for the "remove controls" balancing strategy amounted to ~ 40% normal: 40% gray zone (including GL, GM, and GH): 20% precancer+ (Supp.Table 3).
Finally, we evaluated multiple approaches to dropping layers during training to alleviate overfitting and regularize the learning process by randomly removing neural connections from the model 47 .Spatial dropout drops entire feature maps during training: a rate of 0.1 was applied after each dense layer for the DenseNet models, and after each residual block for the ResNet and ReNest models.The Swin Transformer models were used as implemented in 43 .Monte Carlo (MC) dropout was additionally implemented, which can be thought of as a Bayesian approximation 48 generated by enabling dropout during inference and averaging 50 MC samples.MC models in this work refer to models trained using dropout combined with the inference prediction derived from the 50 forward passes.Additionally, we conducted 20 repeats of individual model runs and plotted histograms highlighting the distribution of standard deviation of the model predicted continuous score and class at the image level in Fig. 8.The variability between repeats is negligible, as highlighted on Fig. 8.

Statistical analysis
Our model selection approach (Fig. 2) consisted of three stages, each utilizing model predictions from the "Model Selection Set"/"Test Set 1".After selection of the 10 best models following stage III, we further evaluated their performance in "Test Set 2" to confirm results from the "Model Selection Set"/"Test Set 1".
In Stage I of our model selection approach, we evaluated models based on their ability to classify pairs of cervical images reliably and repeatedly, termed the repeatability analysis.We calculated the QWK values on Preliminary experiments investigating various values for the α t and γ parameters in the focal loss equation, highlighting the rationale behind optimized values of α t = 0.25 and γ = 2 , which were also reported as optimized values in Lin et al. 44 Here, we iterated across α t = 0.25, 1, and inverse class frequency ("weights") and γ = 1.5, 2, 3 and 4 .Both (a) and (b) illustrate Bland-Altman plots (top panel) and continuous score histograms (bottom panel), highlighting both repeatability and relative class discrimination across the various parameter choices.In (a), γ is held constant, and α t (0.25, inverse class frequency) and the method of reduction (mean, sum) are iterated.In (b), α t and the method of reduction are held constant, while γ (1.5, 2, 3, 4) is iterated.Overall, the results indicate that increasing γ leads to improved repeatability (as indicated by the narrower 95% limit of agreement (LoA) on the Bland Altman plot) but slightly poorer class discrimination (as indicated by the narrower score range in both the Bland Altman plot and the histogram); changing α t and/or the method of reduction has relatively less effect on repeatability and class discrimination.The best overall balance between the two is achieved with α t = 0.25 and γ = 2 , consistent with Lin et al. 44 .
the discrete class outcomes for paired images from the same woman and visit for all models, calculating the mean, median, and inter-quartile range of the QWK for each design choice.We subsequently ran an adjusted multivariate linear regression of the median QWK vs. the various design choice categories and computed the β values and corresponding p-values for each design choice, holding the design choice with the highest median QWK within each design choice category as reference.This allowed us to gauge the relative impacts from the various design choices within each of the model architecture, loss function, balancing strategy, dropout, and ground truth categories.
In Stage II of our approach, we evaluated classification performance based on two key metrics: (1) Youden's index, which captures the overall sensitivity and specificity, and (2) the degree of extreme misclassifications; this is termed the classification performance analysis.We computed both sets of metrics for each of the design choices within each design choice category.Our choice to include misclassification of the extreme classes (i.e., precancer+ classified as normal or extreme false negative, and normal classified as precancer+ or extreme false positive) as metrics was motivated by the importance of these metrics for triage tests 49 .Similar to the repeatability analysis, we calculated the mean, median, and interquartile ranges for these metrics, as well as conducted separate multivariate linear regressions of each of the three median statistics vs. the various design choices categories; we computed the β values and corresponding p-values holding the design choice with the lowest median Youden's index within each design choice category as reference.This allowed for comparison across design choices overall and within each design choice category.
In Stage III of our model selection approach, we selected the best individual models determined by their ability to further stratify the risk of precancer associated with each of four groups of oncogenic high-risk HPVtypes.HPV screening is known to have an extremely high negative predictive value 50,51 , and our approach was motivated by the goal of designing an algorithm to triage HPV positive primary screening.The HPV types were grouped hierarchically in four groupings, in order of decreasing risk 52 : (1) HPV 16; (2) HPV 18 or 45; (3) HPV 31, 33, 35, 52, 58; and (4) HPV 39, 51, 56, 59, 68.In order to assess the ability of a model to further stratify HPV associated risk, we ran logistic regression models on a binary precancer+ vs. < precancer variable.These models were adjusted for hierarchical HPV type group and the model predicted class.We subsequently calculated the difference in AUC between the model adjusted for both predicted class and HPV type group and the model adjusted only for HPV type group and highlighted the 10 models with the best additional stratification (Table 4, Fig. 4).
Finally, we computed additional classification performance metrics (1. % precancer+ as normal; and 2. % normal as precancer+), and repeatability metrics (1. the % 2-class disagreement between image pairs; and 2. QWK values, on the discrete class outcomes for paired images across woman) for each of the top 10 models on "Test Set 2" (Table 5, Fig. 5), in order to further confirm the performance of these models.Additionally, to aid better visualization of predictions at the individual model level, we generated Fig. 6 which compares model predictions across 60 images for each of the top 10 models.To generate this comparison, we first summarized each model's output as a continuous severity score .Specifically, we utilized the ordinality of our problem and defined the continuous severity score as a weighted average using softmax probability of each class as described in Eq. ( 3), where k is the number of classes and p i the softmax probability of class i.  www.nature.com/scientificreports/Put another way, the score is equivalent to the expected value of a random variable that takes values equal to the class labels, and the probabilities are the model's softmax probability at index i corresponding to class label i .For a three-class model, the values lie in the range 0 to 2. We next computed the average of the score for each image across all 10 models and arranged the images in order of increasing score within each class.From this score-ordered list, we randomly selected 20 images per class, maintaining the distribution of mean scores within each class, and arranged the images in order of increasing average score within each class in the top row of Fig. 6, color coded by ground truth.We subsequently compared the predicted class across the 10 models for each of these 60 images (bottom 10 rows of Fig. 5), maintaining the images in the same order as the ground truth row and color-coded by model predicted class.This enabled us to gain a deeper insight and to compare model performance at the individual image level.

Figure 1 .
Figure 1.Model selection and optimization overview.The top panel highlights the five different studies (NHS, ALTS, CVT, Biop and D Biop; seeTable 1, Supp.Table1, and Supp.Methods for detailed description and breakdown of the studies by ground truth) used to generate the final dataset on the middle panel, which is subsequently used to generate a train and validation set, as well as two separate test sets.The intersections of model selection choices on the bottom panel are used to generate a compendium of models trained using the corresponding train and validation sets and evaluated on the "Model Selection Set"/"Test Set 1", optimizing for repeatability, classification performance, reduced extreme misclassifications and combined risk-stratification with high-risk human papillomavirus (HPV) types."Test Set 2" is utilized to verify the performance of top candidates that emerge from evaluation on the "Model Selection Set"/"Test Set 1".SWT: Swin Transformer; QWK: quadratic weighted kappa; CORAL: CORAL (consistent rank logits) loss, as described in the "Methods" section.

Figure 4
Figure4and Table4highlight the 10 best performing models that emerge following Stages 1, 2 and 3 of our model selection approach.All 10 models perform similarly among HPV positive women in the full 5-study set, while showing notable differences per study as shown in the NHS subset of the full 5-study set, measured by the

Figure 2 .
Figure 2. Model selection approach and statistical analysis utilized in our automated visual evaluation (AVE) classifier.IQR: interquartile range; AUC: area under the receiver operating characteristics (ROC) curve; CI: confidence interval.

Figure 3 .
Figure 3. (a) Median quadratic weighted kappa (QWK) and adjusted linear regression (LR) β across the various design choices, as part of the repeatability analysis.(b) Median Youden's index, median % precancer+ as normal (% p as n) and median % normal as precancer+ (% n as p), with the corresponding adjusted LR β values across the various design choices (after filtering for repeatability), as part of the classification performance analysis.Muted bars indicate design choices dropped at each stage.All results are from the "Model Selection Set"/"Test Set 1".SWT: Swin Transformer; CORAL: CORAL (consistent rank logits) loss, as described in the "Methods" section; ref: reference category.

Figure 4 .
Figure 4. (a) Difference between HPV+ AVE combined AUC and HPV-only AUC in the HPV positive NHS subset for top 10 models on the "Model Selection Set"/"Test Set 1" (b) Receiver operating characteristics (ROC) curves for each of the top 4 best performing models in the HPV positive NHS subset of the full dataset The plotted lines indicate (1) HPV AUC, (2) AVE AUC and (3) combined HPV-AVE AUC, for models (i) 36, (ii) 65, (iii) 34, and (iv) 81.HPV: human papillomavirus; AVE: automated visual evaluation, which refers to the classifier; AUC: area under the ROC curve.

Figure 5 .
Figure 5. (a) Classification and repeatability results on "Test Set 2" for top 10 best performing models, highlighting the % precancer+ as normal (%p as n) and % normal as precancer+ (%n as p) (left), the % 2-class disagreement between image pairs across women (middle), and the quadratic weighted kappa (QWK) values on the discrete class outcomes for paired images across women (right) for each model.(b) Representative plots for the top performing model (# 36) on "Test Set 2"-(i) Receiver operating characteristics (ROC) curves for the normal vs rest (Class 0 vs. rest) and precancer+ vs. rest (Class 2 vs. rest) cases, (ii) confusion matrix, (iii) histogram of model predicted continuous score , color coded by ground truth, and (iv) Bland Altman plot of model predictions, color coded by ground truth: each point on this plot refers to a single woman, with the y-axis representing the maximum difference in the score across repeat images per woman, and the x-axis plotting the mean of the corresponding score across all repeat images per woman.

Figure 6 .
Figure6.Model level comparison across top-10 best performing models on "Test Set 2".60 images were randomly selected from "Test Set 2" (see "Methods": "Statistical analysis" section) and arranged in order of increasing mean score within each ground truth class in the top row (labelled "Ground Truth").The model predicted class for the top 10 models for each of these 60 images is highlighted in the bottom rows, where the images follow the same order as the top row.The color coding in the top row represents ground truth while in the bottom 10 rows represent the model predicted class.Green: Normal, Gray: Gray Zone, and Red: Precancer+, as highlighted in the legend.Each image corresponds to a different woman.

Figure 8 .
Figure 8. Histograms highlighting the distribution of standard deviations of the model continuous score (top) and model predicted class (bottom) at the image level across 20 runs, for each of two representative models, where (a) model # 36 and (b) model # 77.For both models (a) and (b), model predictions are derived from "Model Selection Set"/"Test Set 1" (left) and "Test Set 2" (right) respectively.These results indicate that model predictions are consistent across repeat runs, within each model configuration and test set; this is highlighted by the large density of standard deviations of the model predicted class at the image level near 0 (meaning that for a given model configuration, the predicted class of an image remains relatively constant across repeat runs) and the small maximum standard deviation around 0.08 -0.1 (meaning that the model predicted continuous score of an image also changes minimally across repeat runs, and certainly not enough to propagate to a resulting change in predicted class).

Table 1 . Baseline characteristics of women in each of the ground truth categories. Baseline characteristics of
women in each of the ground truth categories, highlighting proportions by histology, cytology, human papillomavirus (HPV) type, study, as well as age and # images/woman.The detailed study descriptions and ground truth assignment by study can be found in Supp.Table1and in the Supp.Methods section.CIN: cervical intraepithelial neoplasia; AIS: adenocarcinoma in situ; ASC-H: atypical squamous cells, cannot rule out high grade squamous intraepithelial lesion; HSIL: high-grade squamous intraepithelial lesion; LSIL: lowgrade squamous intraepithelial lesion; ASCUS: atypical squamous cells of undetermined significance; SD: standard deviation; IQR: interquartile range.CharacteristicsGround truth categories no.(%) Normal (N = 6092) Gray low (N = 867) Gray middle (N = 918) Gray high (N = 529) Precancer+ (N = 1056)

Table 2 .
Repeatability analysis.Repeatability analysis on "Model Selection Set"/"Test Set 1", highlighting quadratic weighted kappa (QWK) summary statistics-mean, median with interquartile range (IQR) and adjusted linear regression (LR) β values-for design choices within each design choice category for our automated visual evaluation (AVE) classifier.Rows in bold indicate design choices filtered out at this stage due to poor repeatability.SWT: Swin Transformer; CORAL: CORAL (consistent rank logits) loss, as described in the "Methods" section; ref: reference category.**indicates significance at the 0.05 level.

Table 2 )
. Rows in bold indicate design choices filtered out at this stage due to poor classification performance (as captured by the Youden's index).Rows in italics indicate design choices subsequently filtered out due to a combination of poor classification performance (as captured by the rate of extreme misclassifications) and/or practical reasons.SWT: Swin Transformer; ref: reference category.**indicates significance at the 0.05 level.

Table 4 .
Selection of top individual models with best additional risk stratification.Performance of top individual models following human papillomavirus (HPV) group combined risk stratification (Stage III of model selection) on "Model Selection Set"/"Test Set 1", within the HPV-positive full-dataset and HPV-positive NHS subset.The models are in decreasing order of area under the receiver operating characteristics (ROC) curve (AUC) on the human papillomavirus (HPV) positive NHS subset of the full dataset.AVE: automated visual evaluation, which refers to the classifier; CI: confidence interval.a Difference = Combined HPV + AVE AUC minus HPV-only AUC.